{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to DataFrames\n",
    "**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "using DataFrames # load package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting basic information about a data frame\n",
    "\n",
    "Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr></thead><tbody><tr><th>1</th><td>1</td><td>1.0</td><td>a</td></tr><tr><th>2</th><td>2</td><td>missing</td><td>b</td></tr></tbody></table>"
      ],
      "text/plain": [
       "2×3 DataFrames.DataFrame\n",
       "│ Row │ A │ B       │ C │\n",
       "├─────┼───┼─────────┼───┤\n",
       "│ 1   │ 1 │ 1.0     │ a │\n",
       "│ 2   │ 2 │ \u001b[90mmissing\u001b[39m │ b │"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = DataFrame(A = [1, 2], B = [1.0, missing], C = [\"a\", \"b\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The standard `size` function works to get dimensions of the `DataFrame`,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2, 3), 2, 3)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "size(x), size(x, 1), size(x, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "as well as `nrow` and `ncol` from R; `length` gives number of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2, 3, 3)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nrow(x), ncol(x), length(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`describe` gives basic summary statistics of data in your `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A\n",
      "Summary Stats:\n",
      "Mean:           1.500000\n",
      "Minimum:        1.000000\n",
      "1st Quartile:   1.250000\n",
      "Median:         1.500000\n",
      "3rd Quartile:   1.750000\n",
      "Maximum:        2.000000\n",
      "Length:         2\n",
      "Type:           Int64\n",
      "\n",
      "B\n",
      "Summary Stats:\n",
      "Mean:           1.000000\n",
      "Minimum:        1.000000\n",
      "1st Quartile:   1.000000\n",
      "Median:         1.000000\n",
      "3rd Quartile:   1.000000\n",
      "Maximum:        1.000000\n",
      "Length:         2\n",
      "Type:           Union{Float64, Missings.Missing}\n",
      "Number Missing: 1\n",
      "% Missing:      50.000000\n",
      "\n",
      "C\n",
      "Summary Stats:\n",
      "Length:         2\n",
      "Type:           String\n",
      "Number Unique:  2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "describe(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `showcols` to get informaton about columns stored in a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2×3 DataFrames.DataFrame\n",
      "│ Col # │ Name │ Eltype                           │ Missing │ Values          │\n",
      "├───────┼──────┼──────────────────────────────────┼─────────┼─────────────────┤\n",
      "│ 1     │ A    │ Int64                            │ 0       │ 1  …  2         │\n",
      "│ 2     │ B    │ Union{Float64, Missings.Missing} │ 1       │ 1.0  …  missing │\n",
      "│ 3     │ C    │ String                           │ 0       │ a  …  b         │"
     ]
    }
   ],
   "source": [
    "showcols(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`names` will return the names of all columns,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3-element Array{Symbol,1}:\n",
       " :A\n",
       " :B\n",
       " :C"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "names(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and `eltypes` returns their types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3-element Array{Type,1}:\n",
       " Int64                           \n",
       " Union{Float64, Missings.Missing}\n",
       " String                          "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eltypes(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we create some large DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "y = DataFrame(rand(1:10, 1000, 10));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and then we can use `head` to peek into its top rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>x1</th><th>x2</th><th>x3</th><th>x4</th><th>x5</th><th>x6</th><th>x7</th><th>x8</th><th>x9</th><th>x10</th></tr></thead><tbody><tr><th>1</th><td>8</td><td>6</td><td>1</td><td>2</td><td>7</td><td>10</td><td>5</td><td>1</td><td>5</td><td>10</td></tr><tr><th>2</th><td>8</td><td>9</td><td>6</td><td>6</td><td>10</td><td>4</td><td>9</td><td>3</td><td>10</td><td>9</td></tr><tr><th>3</th><td>5</td><td>1</td><td>4</td><td>3</td><td>10</td><td>5</td><td>1</td><td>10</td><td>5</td><td>9</td></tr><tr><th>4</th><td>2</td><td>9</td><td>2</td><td>2</td><td>5</td><td>7</td><td>7</td><td>9</td><td>9</td><td>5</td></tr><tr><th>5</th><td>4</td><td>8</td><td>4</td><td>10</td><td>8</td><td>5</td><td>1</td><td>2</td><td>1</td><td>10</td></tr><tr><th>6</th><td>8</td><td>6</td><td>6</td><td>8</td><td>3</td><td>3</td><td>3</td><td>6</td><td>8</td><td>6</td></tr></tbody></table>"
      ],
      "text/plain": [
       "6×10 DataFrames.DataFrame\n",
       "│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │ x8 │ x9 │ x10 │\n",
       "├─────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼─────┤\n",
       "│ 1   │ 8  │ 6  │ 1  │ 2  │ 7  │ 10 │ 5  │ 1  │ 5  │ 10  │\n",
       "│ 2   │ 8  │ 9  │ 6  │ 6  │ 10 │ 4  │ 9  │ 3  │ 10 │ 9   │\n",
       "│ 3   │ 5  │ 1  │ 4  │ 3  │ 10 │ 5  │ 1  │ 10 │ 5  │ 9   │\n",
       "│ 4   │ 2  │ 9  │ 2  │ 2  │ 5  │ 7  │ 7  │ 9  │ 9  │ 5   │\n",
       "│ 5   │ 4  │ 8  │ 4  │ 10 │ 8  │ 5  │ 1  │ 2  │ 1  │ 10  │\n",
       "│ 6   │ 8  │ 6  │ 6  │ 8  │ 3  │ 3  │ 3  │ 6  │ 8  │ 6   │"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "head(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and `tail` to see its bottom rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>x1</th><th>x2</th><th>x3</th><th>x4</th><th>x5</th><th>x6</th><th>x7</th><th>x8</th><th>x9</th><th>x10</th></tr></thead><tbody><tr><th>1</th><td>1</td><td>10</td><td>5</td><td>7</td><td>8</td><td>6</td><td>1</td><td>2</td><td>3</td><td>6</td></tr><tr><th>2</th><td>1</td><td>1</td><td>2</td><td>7</td><td>9</td><td>7</td><td>3</td><td>3</td><td>3</td><td>3</td></tr><tr><th>3</th><td>4</td><td>6</td><td>1</td><td>2</td><td>1</td><td>1</td><td>4</td><td>7</td><td>9</td><td>4</td></tr></tbody></table>"
      ],
      "text/plain": [
       "3×10 DataFrames.DataFrame\n",
       "│ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │ x7 │ x8 │ x9 │ x10 │\n",
       "├─────┼────┼────┼────┼────┼────┼────┼────┼────┼────┼─────┤\n",
       "│ 1   │ 1  │ 10 │ 5  │ 7  │ 8  │ 6  │ 1  │ 2  │ 3  │ 6   │\n",
       "│ 2   │ 1  │ 1  │ 2  │ 7  │ 9  │ 7  │ 3  │ 3  │ 3  │ 3   │\n",
       "│ 3   │ 4  │ 6  │ 1  │ 2  │ 1  │ 1  │ 4  │ 7  │ 9  │ 4   │"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tail(y, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Most elementary get and set operations\n",
    "\n",
    "Given the `DataFrame`, `x`, here are three ways to grab one of its columns as a `Vector`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([1, 2], [1, 2], [1, 2])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[1], x[:A], x[:, 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To grab one row as a DataFrame, we can index as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr></thead><tbody><tr><th>1</th><td>1</td><td>1.0</td><td>a</td></tr></tbody></table>"
      ],
      "text/plain": [
       "1×3 DataFrames.DataFrame\n",
       "│ Row │ A │ B   │ C │\n",
       "├─────┼───┼─────┼───┤\n",
       "│ 1   │ 1 │ 1.0 │ a │"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[1, :]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can grab a single cell or element with the same syntax to grab an element of an array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[1, 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assignment can be done in ranges to a scalar,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr></thead><tbody><tr><th>1</th><td>1</td><td>1.0</td><td>a</td></tr><tr><th>2</th><td>1</td><td>1.0</td><td>b</td></tr></tbody></table>"
      ],
      "text/plain": [
       "2×3 DataFrames.DataFrame\n",
       "│ Row │ A │ B   │ C │\n",
       "├─────┼───┼─────┼───┤\n",
       "│ 1   │ 1 │ 1.0 │ a │\n",
       "│ 2   │ 1 │ 1.0 │ b │"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[1:2, 1:2] = 1\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to a vector of length equal to the number of assigned rows,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr></thead><tbody><tr><th>1</th><td>1</td><td>1.0</td><td>a</td></tr><tr><th>2</th><td>2</td><td>2.0</td><td>b</td></tr></tbody></table>"
      ],
      "text/plain": [
       "2×3 DataFrames.DataFrame\n",
       "│ Row │ A │ B   │ C │\n",
       "├─────┼───┼─────┼───┤\n",
       "│ 1   │ 1 │ 1.0 │ a │\n",
       "│ 2   │ 2 │ 2.0 │ b │"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[1:2, 1:2] = [1,2]\n",
    "x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or to another data frame of matching size."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table class=\"data-frame\"><thead><tr><th></th><th>A</th><th>B</th><th>C</th></tr></thead><tbody><tr><th>1</th><td>5</td><td>6.0</td><td>a</td></tr><tr><th>2</th><td>7</td><td>8.0</td><td>b</td></tr></tbody></table>"
      ],
      "text/plain": [
       "2×3 DataFrames.DataFrame\n",
       "│ Row │ A │ B   │ C │\n",
       "├─────┼───┼─────┼───┤\n",
       "│ 1   │ 5 │ 6.0 │ a │\n",
       "│ 2   │ 7 │ 8.0 │ b │"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x[1:2, 1:2] = DataFrame([5 6; 7 8])\n",
    "x"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Julia 0.6.0",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
